A blind image super-resolution network guided by kernel estimation and structural prior knowledge

The goal of blind image super-resolution (BISR) is to recover the corresponding high-resolution image from a given low-resolution image with unknown degradation. Prior related research has primarily focused effectively on utilizing the kernel as prior knowledge to recover the high-frequency components of image. However, they overlooked the function of structural prior information within the same image, which resulted in unsatisfactory recovery performance for textures with strong self-similarity. To address this issue, we propose a two stage blind super-resolution network that is based on kernel estimation strategy and is capable of integrating structural texture as prior knowledge. In the first stage, we utilize a dynamic kernel estimator to achieve degradation presentation embedding. Then, we propose a triple path attention groups consists of triple path attention blocks and a global feature fusion block to extract structural prior information to assist the recovery of details within images. The quantitative and qualitative results on standard benchmarks with various degradation settings, including Gaussian8 and DIV2KRK, validate that our proposed method outperforms the state-of-the-art methods in terms of fidelity and recovery of clear details. The relevant code is made available on this link as open source.

The task of image super-resolution (SR) is to reconstruct clear high-resolution images from low-resolution images.Image degradation is often considered as the inverse problem of SR, as it involves mathematically modeling the processes that deteriorate the quality of image.According to previous works [1][2][3][4][5] , the pipeline of degradation is typically modeled as Eq.(1).
where x represents the high resolution (HR) image, while y corresponds to the low resolution (LR) image.The operator * denotes the two-dimensional convolution operation and k h is the Gaussian kernel, ↓ s means downsampling operation with a scale factor of s, n refers to additive Gaussian white noise (AGWN).The classical SR methods [6][7][8] assumes that the degradation pipeline is a single bicubic downsampling.However, if the predefined degradation does not exactly match the practical situation, the reconstructed HR image may exhibit unpleasant artifacts 1 .Therefore, recovering shape edges and rich details in the case of LR images with unknown degradation 1,2,5,9-12 , is an extremely meaningful and challenging task.
The most common blind SR schemes are typically divided into two stages: the first stage is to model the kernel explicitly or implicitly through optimizing a deep neural network from the degraded image [1][2][3][4][5]9 , and the second stage inputs the LR image combined with additional degradation prior through the SR network to obtain reconstructed HR image. I first stage, the mismatch between estimated blur kernel and the actual one can lead to over-smoothed or over-sharpened results [1][2][3] .An available solution is to perform accurate estimation of the kernel 1,9 and robust integration with the SR backbone 2,3,5 .

Related work SR of bicubic and multiple degradation
The pioneering work of SRCNN 6 has successfully motivated interest among researchers in the field of SR.Inspired by hierarchical architecture 7,8,17 and robust loss function 11,12,[18][19][20][21] , CNN-based methods have achieved outstanding performance on predefined bicubic downsampling in the SR task, while the degradation process in the real-world are generally unknown and complicated 11,12 .In practical applications, if the bicubic kernel assumed by classical methods does not match the actual degradation kernel, it will lead to unpleasant artifacts in the reconstructed SR image, severely affecting the visual perception quality.This discrepancy between the assumed kernel and the actual kernel give rise to domain gap [22][23][24] , which is a challenge in practical applications of SR.
Another approach to non-blind SR method 4,[25][26][27][28] is designed to super-resolve multiple types of degraded images with corresponding kernels.These methods make classical SR networks more robust and applicable to a wider range of real-world scenarios.FFDNet 25 utilizes a noise level map as additional input, allowing it to handle various noisy images affected by different types of degradation.Similarly, SRMD 4 proposes a kernel stretching strategy that incorporates the two degradation parameters, the blur kernel k and the noise level n, together with the LR as input to SR network.Zhang et al. 29 combines learning-based methods with model-based methods to design an end-to end unfolding networks that can handle various types of degraded images with different scales.UDVD 27 introduces dynamic convolution in the kernel estimation network, where the parameters of the filters can be dynamically adjusted based on the adaptivity of the input degraded image.KMSR 26 utilizes generative adversarial networks to learn the distribution of kernels in real degraded images.Inspired by KMSR 26 , Son et al. 28 propose an adaptive downsampling model that employs an unsupervised approach to simulate the actual Figure 1.Blind super-resolution of Img100 from DIV2KRK 9 , for scale factor 4. Based on the fusion of local and global features, our method is effective in restoring sharp and clean edges, and outperforms previous state-ofthe-art approaches such as ZSSR 13 , IKC 1 , AdaTarget 14 , DANv2 2 , and DCLS 3 .
www.nature.com/scientificreports/degradation process of real-world images.They then synthesize paired data and develop an SR network capable of handling various types of degradation.

SR of unknown kernel
The most common approach for the blind SR task is based on kernel estimation methods 1-5,9,30 .KernelGAN 9 utilizes cross-scale image similarity to accomplish kernel estimation on specific images and combined it with a classical method 13 to achieve blind reconstruction.MANet 30 further investigates spatially variant blur kernels in order to super-resolve objection motion and out-of-focus in real world scenarios.Gu et al. 1 use an iterative correction method to alleviate the effects caused by the mismatch between estimated result and practical kernel.Luo et al. 2,5 adopt an end-to-end network to alternately optimize estimator and restorer.These two methods 1,2 are effective but time-consuming owing to the elaborate optimization steps.DCLS 3 reformulates a practical degradation model and proposes a deep constrained least squares module to operate deconvolution in order to achieve robust degradation awareness.In the aforementioned methods [1][2][3]5,9,22,23 , the solution is concentrated on modeling degradation either implicitly 22,23,31 or explicitly [1][2][3][4][5]9,10,32 without delving into the function of structural textures as prior knowledge. This ma be a potential factor leading to the upper bound of blind SR performance.

Method Architecture
In this subsection, we will introduce the overall architecture of our model.As shown in Fig. 2, our method mainly contains two stages: degradation representation embedding, and texture details recovery.The first stage includes the dynamic kernel estimation and deblurring operation based on the DCLS 3 module.The estimator N e accomplishes robust kernel estimation from degraded LR image.Next, the LR image and the estimated blur kernel k are jointly input into the DCLS module for deblurring.Lastly, the clean and original shallow features are fed into the triple path attention network to achieve local and global features fusion, which consists of triple path attention blocks (TPAB) and global texture fusion blocks (GTFB).Details on the pipeline of our method and the relevant blocks will be described in the following subsections.

Degradation representation embedding
Inspired by the work of 3 , our method employs the dynamic kernel estimation, as shown in Fig. 3. Given an LR image with unknown degradation as input, three residual blocks are applied to extract deep features f s , followed by global average pooling to obtain the flattened features f s .The fully connected layer maps the specific degrada- tion information to the four various filters, h 0 , h 1 , h 2 , and h 3 , with kernel sizes set to 11 × 11 , 7 × 7 , 5 × 5 and 1 × 1 , respectively, to adjust the receptive filed consistency with the kernel sizes of predicted kernel k.The process of dynamic estimation is shown in Eq. ( 2).where I k is the identity kernel, and h 0 , h 1 , h 2 , and h 3 are specific filters mapped from degradation information, k is the estimated kernel through Estimator N e .The I k is sequentially convolved with these filters, enabling the parameters in network N e to vary with different degraded inputs.Meanwhile, the DCLS 3 module utilizes decon- volutional operations to obtain clean feature as Eq. ( 3).
where f o represents the blurry original features extracted by a 3 × 3 convolution layer and three residual blocks from the LR image, k is the kernel predicted by the network N e , f c represents the deblurred clean features through the deconvolutional operation via the DCLS 3 module.

Texture details recovery
Even with introducing deconvolutional operation through the DCLS 3 module, the damaged high-frequency information cannot be fully restored.Therefore, we propose a novel network that not only strongly extracts local features to compensate for the decline of high-frequency components but also incorporates non-local 15,16 operation to fuse the local and global features.
Figure 2 illustrates the proposed SR network, mainly consists of the extraction process of original features and the fusion process of local features with global features.A 3 × 3 convolutional kernel and three residual blocks without batch normalization 33 is used to extract original features f o as Eq. ( 4).
where I LR ∈ R H×W×C is an LR image as input, H and W represent the height and width of the patch that is cropped from a sub-image, and C is the RGB channels in the image.
In previous stages we have obtained clean features f c .FAIG 34 demonstrates that one branch network without degradation prior can achieve comparable performance to the two-branch method with degradation information.Although it may be reasonable to directly use the clean feature f c as input to the SR network for recovery, the offset of kernel estimation 9,30 and insufficiency of deblurring function in the DCLS 3 module would prevent the SR network from effectively restoring highly structured textures in the SR backbone.Therefore, we propose a Triple Path Attention Group (TPAG) to extract deep feature f as Eq. ( 6).
where the ψ(fc, f o , f o ) represents TPAG that adopts the clean feature f c , chunked original feature f o and f o as additional inputs, h GTFB (h n TPAB ) means that the group is composed of n Triple Path Attention Blocks (TPAB) and one Global Texture Fusion Block (GTFB).f is the deep clean feature, N is the number of TPAG in our SR network.
In addition, we further refine the deep feature f through a 3 × 3 convolutional layer with the original low- frequency feature f o connected through long skip connections 7,8,35,36 , as Eq. ( 7).
(2) www.nature.com/scientificreports/Finally, pixel shuffle 37 serves as the upsampling module and completes the mapping from feature maps to HR image I SR .

Triple path attention block
Deep SR networks contain specific filters that can handle various types and levels of degraded images 34 .These specific filters, which can be used to address corresponding degradation such as noise and blur, are located at different positions and branches within a single SR network.Channel attention 8,36,38,39 and spatial attention 40,41 mechanisms can enhance the local modeling ability.Therefore, we introduce these mechanisms as two branches in TPAB, allowing the network to strengthen its generalization and better handle different types of degradation.
The triple path attention blocks, consisting of residual channel attention and residual local spatial blocks, is shown in Fig. 2. The original shallow features f o are split into two feature maps f o and f o along the channel dimension.They are combined with the deblurred clean features f c and passed through TPABs to refine local texture features and compensate for the loss of high-frequency texture details.Specifically, f o and f o are processed respectively by residual channel attention branches 8 and residual local spatial branches 41 to extract deep local features.Meanwhile, f o and f o are concatenated with f o and fused by a convolutional layer.Lastly, the aggregated local features pass through a GTFB to establish connections between local and non-local features.

Global texture fusion block
Non-local 15,16,42 operations are capable of capturing long-range dependencies between different parts of an image, addressing the limitation of receptive filed by introducing self-attention mechanisms that enable each position to attend to all other positions in the input data.This operation is particularly instrumental in restoring structural textures that exhibit strong self-similarity.Previous researchers 15,42 hypothesized that non-local textures with higher similarity scores would be more advantageous for restoring edge information.However, they overlooked an objective fact that when an image suffers from severe degradation, non-local textures with low similarity scores may actually be more useful for restoring edges 16 .
Fusing the local spatial texture features without careful consideration does not significantly improve the network's ability to restore textures.Therefore, we cascade a global texture feature fusion block (GTFB) at the end of each TPAG.In the module, we adopt the global learnable attention block 16 after the local feature fusion.The global learnable attention block adaptively adjusts the similarity scores of non-local textures, allowing the network to effectively utilize non-local textures that previously had low similarity scores but can provide rich details.
As shown in Fig. 4, we input the feature map X ∈ R H×W×C as the input and convert X into three 1D vectors Q, L and V ∈ R C×HW to achieve global attention mechanism.Super-Bit Locality-Sensitive Hashing (SB-LSH) divides the feature map into buckets to reduce computation costs, as shown in the Eq. ( 8).
where M ∈ R b×c is a randomly initialized orthogonal matrix and b is the number of hash buckets, X i ∈ R C is the i − th component of Q i , i is the index set corresponding to Q i .Next, we use learnable similarity score X l (LSS) and fixed dot product similarity score X f (DPSS) to measure self-similarity as Eq. ( 9).where S f (X i ) = X T i X i , S l (X i ) is defined as Eq. (10). (

Loss function
Our model includes the kernel estimation task and the reconstruction task.We jointly optimize our model using L 1 Loss L kernel and Charbonnier Loss L pixel , as shown in the Eq.(11).
where the L kernel = ||k − k l || is the L 1 loss between estimated kernel k and the ground truth blur kernel k l .The pixel loss is defined as L pixel = (I SR − I HR ) 2 + ǫ , where I SR and I HR denote the super-resolved image and the ground-truth HR image, ǫ is a constant and usually 1 × 10 −6 .

Experiments Datasets and implementation details
Datasets and metrics Following previous work 1,2,5 , we used the DIV2K 50 (800) and the Flickr2K 51 (2650) as the training data, which together contain 3450 2K HR images.We adopt both isotropic and anisotropic Gaussian kernels as assumed degradation to synthesize corresponding LR images according to Eq. (1).The experimental results are evaluated using the PSNR and SSIM 52 metrics for fidelity, which are only calculated on the Y channel of the YCbCr color space.

Isotropic Gaussian kernels
In the setting 1, isotropic Gaussian kernels are first applied in our study as the same in 1-

Anisotropic Gaussian kernels
In the setting 2, anisotropic Gaussian kernels were employed in our study follwing the work in 1-3,5, During the testing process, blind SR benchmark DIV2KRK 9 were used for evaluation.

Implementation details
We cropped the training data into sub-images of size 480 × 480 , and utilized LR patches of size 64 × 64 to feed into our model.Our SR network consists of 6 groups of TPAG, each consisting of 11 TPABs and 1 GTFB.We trained the model using 8 RTX2070 GPUs, with a batch size of 4 for each GPU.The initial learning rate was 1 × 10 −4 and decayed by half at every 2 × 10 5 iterations, the total number of iterations was 1 × 10 6 .We used the Charbonnier loss 21 as loss function and Adam 53 optimizer with β 1 0.9 and β 2 0.99 for optimization.We also adopt horizontal flipping and 90 • rotation as data augmentation strategies during the training phase.

Evaluation with isotropic Gaussian kernels
We have evaluated our method on benchmarks synthesized by Gaussian8 kernels and compared its performance with those using state-of-the-art blind SR methods, including ZSSR 13 , IKC 1 , DANv1 5 , DANv2 2 , AdaTarget 14 , KOALAnet 32 , and DCLS 3 .Additionally, CARN 48 as a lightweight non-blind SR model that combined with blind deblurring 49 method was also implemented for comparison.
The quantitative comparisons on benchmarks with Gaussian8 kernels are shown in Table 1.Our method achieves remarkable results on various benchmarks, particularly exhibiting noticeable performance on datasets with strong self-similarity, such as Urban100 46 and Manga109 47 , nearly + 0.16dB and + 0.15dB than DCLS 3 on × 4 factor.Bicubic interpolation and CARN 48 are non-blind SR methods that assume a known bicubic degradation, which deviates from the actual situation, resulting in a severe drop in performance.ZSSR 13 utilizes the internal statistics of patch recurrence to build an image-specific super-resolution method that does not require external datasets.This approach slightly improves performance due to the lack of abundant training data and powerful fitting ability.Performing the blind deblurring 49 operation on the reconstructed image can moderately improve performance by reducing artifacts caused by domain gap.Conversely, applying the inverse operation may further damage details in the LR image, leading to unsatisfactory SR results.The IKC 1 and DAN 5 compensate for the L total = L kernel + L pixel , offset caused by kernel estimation through iterative correction and end-to-end alternate optimization, respectively, significantly improving the performance.DCLS 3 can retain the spatial information of the blur kernel while introducing dynamic convolution to boost the robustness of estimation, thus achieving superior performance.
Our proposed TPAB compensates for the attenuation of high-frequency components caused by the DCLS 3 deconvolution module and the GTFB integrates non-local features with low similarity scores to assist in the fusion of local and global features.The qualitative visual results in Fig. 5 also demonstrate that our method is capable of recovering sharp edges and rich details.Furthermore, considering the complexity of actual degradation, we conduct an extra experiment to handle images with Gaussian8 kernels and additional noise.The quantitative results, shown in Table 2, validate that our method also has a certain degree of robustness to additional noise.
Table 3 shows the quantitative results of these methods on the DIV2KRK 9 dataset.The results indicates that ZSSR 13 can serve as a method for improving bicubic interpolation performance.When combined with the kernel estimation by KernelGAN 9 as a prior, the performance of ZSSR 13 is further improved.SRMD 4 shows the consistently with bicubic interpolation.Classical SR methods such as RCAN 8 , EDSR 7 , and DBPN 54 , which adopted paired training data degraded by bicubic downsampling, suffer an extreme decrease in performance due to domain gap.The correction filter 55 modifies the blurry image to match bicubic kernel, significantly improving the performance of DPBN 54 trained on bicubic kernel.
Among the remaining blind SR methods, which contain IKC 1 , DAN 2,5 , KOALAnet 32 , AdaTarget 14 ,and DCLS 3 , our method performed slightly superior than the DCLS 3 .This circumstance is consistent with our hypothesis.Due to the wild degradation of the DIV2KRK 9 dataset, the textures and edges are damaged severely.The compensation of TPAB module for high-frequency features is limited.GTFB cannot accurately adjust the similarity score of local textures, resulting in the reconstruction of high-frequency information that is not as good as isotropic Gaussian kernels with mild degradation.www.nature.com/scientificreports/

Ablation study and discussion
In this subsection, we performed a series of ablation experiments on the two crucial modules proposed by us, TPAB and GTFB, to quantitatively study their contributions to our method.The specific settings related to the ablation experiments are shown in the Table 4. Firstly, the DCLS 3 adopt clean feature f c with original f o as input to feed into Double Path Attention Groups (DPAG) to reconstruct HR images.The DCLS was used as baseline to explore the function of our proposed modules TPAB and GTFB.
Secondly, we placed DPAG with our proposed TPAG, where original feature f o was split into f o and f o to extract channel and spatial local feature to compensate for high-frequency decline.In this setting, without the function of global feature fusion, the single GTFB was placed by a TPAB.It can be observed from Table 5 that adding only the TPAB module resulted in a minimal improvement in performance(+ 0.02db in Set14 44 and + 0.01dB in Manga109 47 ).This may be because the depth of TPAG is already sufficient for extracting degradation feature, and using TPAB alone to capture local texture features has limited compensatory effects on highfrequency information.www.nature.com/scientificreports/Lastly, we utilized a variant network consisting of Double Path Attention blocks (DPAB) and Global texture fusion block to evaluate the contribution of GTFB, we appended a GTFB in each DPAG.The results shows a similar trend to the previous experiments, indicating GTFB could better utilize non-local textures to reconstruct high-frequency details.However, due to the lack of tiny compensation from the TPAB module, there is only a moderate performance improvement(about + 0.05dB in Urban100 46 ), and the ability to reconstruct texture information was still insufficient.

Performance on real degradation
To further demonstrate the effectiveness of our method, we utilized the proposed model with isotropic Gaussian kernels and additional noise level 15 on real degradation images where the degradation is complicated and unknown.Our model was compared with classical real-world super resolution methods including RealSR 10 , BSRGAN 11 , Real-ESRGAN 12 , DASR 31 , and MM-RealSR 56 on Real20 11 dataset.An example of super-resolving chip image is shown in Fig. 6.Our method still produce rich details and sharp edges.

Discussion
The specific results of the ablation experiments are shown in Table 5.It is evident that adding either module alone only results in a marginal performance gain(approximately + 0.05dB in Set14 44 and BSD100 45 ).However, the flexible combination of two modules achieves astonishingly higher performance (+ 0.16dB and + 0.13dB in Urban100 46 Manga109 47 respectively than only one module).One possible reason is that even slight compensation of high-frequency information is crucial for the adaptive adjustment of similarity scores in global learnable attention 16 block.With the aggregation of local features on both channel and spatial dimensions introduced by the TPAG module, the GTFB exhibits a stronger ability to fuse global information.

Limitation
Our model has achieved good results in super-resolving images with both synthetic degradation and real-world.However, since our training data only covers blurring and noise, without considering more severe and complicated degradation, our model's performance is not satisfactory when facing images with wild degradation.Meanwhile, due to the dependence on predicting specific kernel parameters, the accuracy of kernel estimation still has a moderate impact on the reconstructed image.We also conducted a comparison of running time and mode size with state-of-the-arts methods, and the results are shown in Table 6.Due to the global information modeling performed by the GLA 16 module, the computational cost is increased.And channel split strategy increases memory access cost, which is a significant factor affecting inference speed.

Conclusion
In this work, we propose a blind SR network that is capable of combining kernel estimation with structural prior knowledge.Our method consists of two steps: degradation representation embedding and texture details recovery.A triple path attention block was first proposed to extract local spatial and channel features to compensate for the loss of high-frequency components caused by the first steps.Subsequently, the global texture fusion block was used to fuse local and global textures, thus providing complementary information for the recovery of HR images.A serious of experiments on benchmarks with different degradation settings demonstrates that our method achieves outstanding performance in blind SR.In future work, we primarily have three main tasks: First, we will utilize contrastive learning to predict the degradation representation of images to disguise different types and levels of degradation, rather than specific parameters of kernel.Second, we will attempt more practical degradation methods to further generalize the model to realworld images.

Figure 2 .
Figure 2. The overall architecture of our network and the structure of related blocks.Given an LR image, we first estimate the kernel k, and feed into DCLS module to achieve degradation presentation embedding.The triple path attention groups utilize the clean feature f c and the chunked original feature f o and f o as input to restore the clean SR image.

Figure 3 .
Figure 3.The overall architecture of dynamic kernel estimation.Given an LR image input, it first generate four specific filters.Then, these filters convolved sequentially with an identity kernel I k to produce a single kernel k with a larger receptive field corresponding kernel size.

Figure 4 .
Figure 4.The details about global learnable attention 16 block.
9 .The kernel size is 11 × 11 and 31 × 31 for scale factors 2 and 4 respectively in the training stages.During the training pro- cess, we randomly sampled the kernel width from the ranges of [0.6, 5] and rotated it from the range [ −π , π ] .

Table 1 .
The quantitative results on benchmarks with Gaussian8 kernels.The best two results are marked in bold and italic, respectively.

Table 2 .
47e quantitative comparison on benchmarks with Gaussian8 kernels and various noise levels.The best two results are marked in bold and italic, respectively.The visual results of sig1.8_img093,sig2.4_img024,sig3.0_img073 in Urban10046and sig3.2_YouchienBoueigumi in Manga10947.

Table 3 .
The quantitative results on DIV2KRK benchmark with isotropic Gaussian kernel.The best two results are marked in bold and italic, respectively.

Table 4 .
The details of ablation study.The SR Network contains five groups that consist of various number of input and blocks based on whether channel split strategy is adopted.

Table 5 .
The ablation study on benchmarks with Gaussian8 kernels.The FlOPs are calculated with input size of 270×180.

Table 6 .
The comparison of complexity of different models.The inference latency is tested on RTX3090 GPU.The FLOPs are calculated with input size of 270 × 180.